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Abstract 

Modelling compositional meaning for sen- 
tences using empirical distributional methods 
has been a challenge for computational lin- 
guists. We implement the abstract categorical 
model of Coecke et al. (2010) using data from 
the BNC and evaluate it. The implementation 
is based on unsupervised learning of matrices 
for relational words and applying them to the 
vectors of their arguments. The evaluation is 
based on the word disambiguation task devel- 
oped by Mitchell and Lapata (2008) for intran- 
sitive sentences, and on a similar new experi- 
ment designed for transitive sentences. Our 
model matches the results of its competitors 
in the first experiment, and betters them in the 
second. The general improvement in results 
with increase in syntactic complexity show- 
cases the compositional power of our model. 

1 Introduction 

As competent language speakers, we humans can al- 
most trivially make sense of sentences we've never 
seen or heard before. We are naturally good at un- 
derstanding ambiguous words given a context, and 
forming the meaning of a sentence from the mean- 
ing of its parts. But while human beings seem 
comfortable doing this, machines fail to deliver. 
Search engines such as Google either fall back on 
bag of words models — ignoring syntax and lexical 
relations — or exploit superficial models of lexical 
semantics to retrieve pages with terms related to 
those in the query (Manning et al., 2008). 

However, such models fail to shine when it comes 
to processing the semantics of phrases and sen- 



tences. Discovering the process of meaning as- 
signment in natural language is among the most 
challenging and foundational questions of linguis- 
tics and computer science. The findings thereof will 
increase our understanding of cognition and intelli- 
gence and shall assist in applications to automating 
language -related tasks such as document search. 

Compositional type-logical approaches (Mon- 
tague, 1974; Lambek, 2008) and distributional mod- 
els of lexical semantics (Schutze, 1998; Firth, 1957) 
have provided two partial orthogonal solutions to the 
question. Compositional formal semantic models 
stem from classical ideas from mathematical logic, 
mainly Frege's principle that the meaning of a sen- 
tence is a function of the meaning of its parts (Frege, 
1892). Distributional models are more recent and 
can be related to Wittgenstein's later philosophy of 
'meaning is use', whereby meanings of words can be 
determined from their context (Wittgenstein, 1953). 
The logical models relate to well known and robust 
logical formalisms, hence offering a scalable theory 
of meaning which can be used to reason inferen- 
tially. The distributional models have found their 
way into real world applications such as thesaurus 
extraction (Grefenstette, 1994; Curran, 2004) or au- 
tomated essay marking (Landauer, 1997), and have 
connections to semantically motivated information 
retrieval (Manning et al., 2008). This two-sortedness 
of defining properties of meaning: 'logical form' 
versus 'contextual use', has left the quest for 'what is 
the foundational structure of meaning?' even more 
of a challenge. 

Recently, Coecke et al. (2010) used high level 
cross-disciplinary techniques from logic, category 



theory, and physics to bring the above two ap- 
proaches together. They developed a unified mathe- 
matical framework whereby a sentence vector is by 
definition a function of the Kronecker product of its 
word vectors. A concrete instantiation of this the- 
ory was exemplified on a toy hand crafted corpus 
by Grefenstette et al. (201 1). In this paper we imple- 
ment it by training the model over the entire BNC. 
The highlight of our implementation is that words 
with relational types, such as verbs, adjectives, and 
adverbs are matrices that act on their arguments. We 
provide a general algorithm for building (or indeed 
learning) these matrices from the corpus. 

The implementation is evaluated against the task 
provided by Mitchell and Lapata (2008) for disam- 
biguating intransitive verbs, as well as a similar new 
experiment for transitive verbs. Our model improves 
on the best method evaluated in Mitchell and Lapata 
(2008) and offers promising results for the transitive 
case, demonstrating its scalability in comparison to 
that of other models. But we still feel there is need 
for a different class of experiments to showcase mer- 
its of compositionality in a statistically significant 
manner. Our work shows that the categorical com- 
positional distributional model of meaning permits 
a practical implementation and that this opens the 
way to the production of large scale compositional 
models. 

2 Two Orthogonal Semantic Models 

Formal Semantics To compute the meaning of a 
sentence consisting of n words, meanings of these 
words must interact with one another. In formal se- 
mantics, this further interaction is represented as a 
function derived from the grammatical structure of 
the sentence, but meanings of words are amorphous 
objects of the domain: no distinction is made be- 
tween words that have the same type. Such models 
consist of a pairing of syntactic interpretation rules 
(in the form of a grammar) with semantic interpreta- 
tion rules, as exemplified by the simple model pre- 
sented in Figure 1 . 

The parse of a sentence such as "cats like milk" 
typically produces its semantic interpretation by 
substituting semantic representation for their gram- 
matical constituents and applying /3-reduction where 
needed. Such a derivation is shown in Figure 2. 



Syntactic Analysis 


Semantic Interpretation 


S -S- NP VP 


\VP\(\NP\) 


NP — > cats, milk, etc. 


cats , milk , . . . 


VP -> Vt NP 


\Vt\(\NP\) 


Vt — > like, hug, etc. 


Aya;.|like|(a;,t/), . . . 



Figure 1: A simple model of formal semantics. 
| like | ( | cats | , |milk|) 




cats) Ax. | like | (a;, |milk|) 




|milk| Xyx.\\ike\(x,y) 
Figure 2: A parse tree showing a semantic derivation. 

This methodology is used to translate sentences 
of natural language into logical formulae, then use 
computer-aided automation tools to reason about 
them (Alshawi, 1992). One major drawback is that 
the result of such analysis can only deal with truth 
or falsity as the meaning of a sentence, and says 
nothing about the closeness in meaning or topic of 
expressions beyond their truth-conditions and what 
models satisfy them, hence do not perform well on 
language tasks such as search. Furthermore, an un- 
derlying domain of objects and a valuation function 
must be provided, as with any logic, leaving open 
the question of how we might learn the meaning of 
language using such a model, rather than just use it. 

Distributional Models Distributional models of 
semantics, on the other hand, dismiss the interaction 
between syntactically linked words and are solely 
concerned with lexical semantics. Word meaning 
is obtained empirically by examining the contexts 1 
in which a word appears, and equating the meaning 
of a word with the distribution of contexts it shares. 
The intuition is that context of use is what we ap- 
peal to in learning the meaning of a word, and that 
words that frequently have the same sort of context 
in common are likely to be semantically related. 

For instance, beer and sherry are both drinks, al- 
coholic, and often cause a hangover. We expect 
these facts to be reflected in a sufficiently large cor- 
pus: the words 'beer' and 'sherry' occur within the 

1 E.g. words which appear in the same sentence or n-word 
window, or words which hold particular grammatical or depen- 
dency relations to the word being learned. 



context of identifying words such as 'drink', 'alco- 
holic' and 'hangover' more frequently than they oc- 
cur with other content words. 

Such context distributions can be encoded as vec- 
tors in a high dimensional space with contexts as 
basis vectors. For any word vector word, the scalar 
weight cf ord associated with each context basis vec- 
tor m is a function of the number of times the 
word has appeared in that context. Semantic vectors 
(cf ord , cf ord , • • • , C ord ) are also denoted by sums 
of such weight/basis vector pairs: 

w^rcl = 

i 

Learning a semantic vector is just learning its ba- 
sis weights from the corpus. This setting offers ge- 
ometric means to reason about semantic similarity 
(e.g. via cosine measure or A;-means clustering), as 
discussed in Widdows (2005). 

The principal drawback of such models is their 
non-compositional nature: they ignore grammatical 
structure and logical words, and hence cannot com- 
pute the meanings of phrases and sentences in the 
same efficient way that they do for words. Com- 
mon operations discussed in (Mitchell and Lapata, 
2008) such as vector addition (+) and component- 
wise multiplication (0, cf. §4 for details) are com- 
mutative, hence if vw = it + tTj or v w, then 
mb = wi, leading to unwelcome equalities such as 

the dog bit the man = the man bit the dog 

Non-commutative operations, such as the Kronecker 
product (cf. §4 for definition) can take word-order 
into account (Smolensky, 1990) or even some more 
complex syntactic relations, as described in Clark 
and Pulman (2007). However, the dimensionality of 
sentence vectors produced in this manner differs for 
sentences of different length, barring all sentences 
from being compared in the same vector space, and 
growing exponentially with sentence length hence 
quickly becoming computationally intractable. 

3 A Hybrid Logico-Distributional Model 

Whereas semantic compositional mechanisms for 
set-theoretic constructions are well understood, 
there are no obvious corresponding methods for vec- 
tor spaces. To solve this problem, Coecke et al. 



(2010) use the abstract setting of category theory to 
turn the grammatical structure of a sentence into a 
morphism compatible with the higher level logical 
structure of vector spaces. 

One pragmatic consequence of this abstract idea 
is as follows. In distributional mod els, t here is a 
meaning vector for each word, e.g. cats, like, and 
mihc. The logical recipe tells us to apply the mean- 
ing of the verb to the meanings of subject and object. 
But how can a vector apply to other vectors? The so- 
lution proposed above implies that one needs to have 
different levels of meaning for words with different 
types. This is similar to logical models where verbs 
are relations and nouns are atomic sets. So verb vec- 
tors should be built differently from noun vectors, 
for instance as matrices. 

The general information as to which words should 
be matrices and which words atomic vectors is in 
fact encoded in the type-logical representation of the 
grammatical structure of the sentence. This is the 
linear map with word vectors as input and sentence 
vectors as output. Hence, at least theoretically, one 
should be able to build sentence vectors and com- 
pare their synonymity in exactly the same way as 
one measures word synonymity. 

Pregroup Grammars The aforementioned linear 
maps turn out to be the grammatical reductions 
of a type-logic called a Lambek pregroup gram- 
mar (Lambek, 2008) 2 . Pregroups and vector spaces 
share the same high level mathematical structure, re- 
ferred to as a compact closed category, for a proof 
and details of this claim see Coecke et al. (2010); for 
a friendly introduction to category theory, see Co- 
ecke and Paquette (2011). One consequence of this 
parity is that the grammatical reductions of a pre- 
group grammar can be directly transformed into lin- 
ear maps that act on vectors. 

In a nutshell, pregroup types are either atomic 
or compound. Atomic types can be simple (e.g. n 
for noun phrases, s for statements) or left/right 
superscripted — referred to as adjoint types (e.g. n r 
and n l ). An example of a compound type is that of 
a verb n r sn l . The superscripted types express that 
the verb is a relation with two arguments of type n, 

2 The usage of pregroup types is not essential, the types of 
any other logic, for instance CCG can be used, but should be 
translated into the language of pregroups. 



which have to occur to the right and to the /eft of 
it, and that it outputs an argument of the type s. A 
transitive sentence has types as shown in Figure 3. 

Each type n cancels out with its right adjoint n r 
from the right and its left adjoint n l from the left; 
mathematically speaking these mean 3 

n l n < 1 and nn r < 1 

Here 1 is the unit of concatenation: In = nl = 
n. The corresponding grammatical reduction of a 
transitive sentence is nrfsn 1 < Isl = s. Each such 
reduction can be depicted as a wire diagram. The 
diagram of a transitive sentence is shown in Figure 3. 

Cats like milk. 

n n r s n l n 

w i w 

Figure 3: The pregroup types and reduction diagram for 
a transitive sentence. 

Syntax-guided Semantic Composition Accord- 
ing to Coecke et al. (2010) and based on a general 
completeness theorem between compact categories, 
wire diagrams, and vector spaces, the meaning of 
sentences can be canonically reduced to linear alge- 
braic formulae. The following is the meaning vector 
of our transitive sentence: 

cats like mili = (/) ^cats ® like ® millt) (I) 

Here / is the linear map that encodes the grammati- 
cal structure. The categorical morphism correspond- 
ing to it is denoted by the tensor product of 3 compo- 
nents: eytgilstgieiy, where V and W are subject and 
object spaces, S is the sentence space, the e's are the 
cups, and Is is the straight line in the diagram. The 
cups stand for taking inner products, which when 
done with the basis vectors imitate substitution. The 
straight line stands for the identity map that does 
nothing. By the rules of the category, equation (I) re- 
duces to the following linear algebraic formula with 

3 The relation < is the partial order of the pregroup. It corre- 
sponds to implication => in a logical reading thereof. If these 
inequalities are replaced by equalities, i.e. if n l n — 1 = nn r , 
then the pregroup collapses into a group where n l = n r . 



lower dimensions, hence the dimensional explosion 
problem for Kronecker products is avoided: 

^c^(c^|4)st(^|rrmt) G S (II) 

itj 

Vi,Wj are basis vectors of V and W. The inner 
product (cats | vl) substitutes the weights of cats 
into the first argument place of the verb (similarly 
for object and second argument place), s£ is a basis 
vector of the sentence space S in which meanings of 
sentences live, regardless of their grammatical struc- 
ture. 

The degree of synonymity of sentences is ob- 
tained by taking the cosine measure of their vectors. 
S is an abstract space: it needs to be instantiated 
to provide concrete meanings and synonymity mea- 
sures. For instance, a truth-theoretic model is ob- 
tained by taking the sentence space S to be the 2- 
dimensional space with basis vectors |1) (True) and 
|0) (False). 

4 Building Matrices for Relational Words 

In this section we present a general scheme to build 
matrices for relational words. Recall that given 
a vector space A with basis {n^/i, the Kronecker 
product of two vectors it = ^ c°;r?i and w = 
Y^i cffii is defined as follows: 

it ® v$ = ^2 c i c ) (™i ® nj) 

where (nt <8> ny) is just the pairing of the basis of A, 
i.e. (nt, nj). The Kronecker product vectors belong 
in the tensor product of A with itself: A® A, hence 
if A has dimension r, these will be of dimensionality 
r x r. The point-wise multiplication of these vectors 
is defined as follows 

i 

The intuition behind having a matrix for a rela- 
tional word is that any relation R on sets X and Y, 
i.e. R C X x Y can be represented as a matrix, 
namely one that has as row-bases x G X and as 
column-bases y G Y, with weight c xy = 1 where 
(2, y) G R and otherwise. In a distributional set- 
ting, the weights, which are natural or real numbers, 



will represent more: 'the extent according to which 
x and y are related'. This can be determined in dif- 
ferent ways. 

Suppose X is the set of animals, and 'chase' is a 
relation on it: chase C X x X. Take x = 'dog' 
and y = 'cat' : with our type-logical glasses on, the 
obvious choice would be to take c xy to be the num- 
ber of times 'dog' has chased 'cat', i.e. the number 
of times the sentence 'the dog chases the cat' has 
appeared in the corpus. But in the distributional set- 
ting, this method will be too syntactic and dismissive 
of the actual meaning of 'cat' and 'dog'. If instead 
the corpus contains the sentence 'the hound hunted 
the wild cat', c xy will be 0, restricting us to only 
assign meaning to sentences that have directly ap- 
peared in the corpus. We propose to, instead, use a 
level of abstraction by taking words such as verbs to 
be distributions over the semantic information in the 
vectors of their context words, rather than over the 
context words themselves. 

Start with an r-dimensional vector space N with 
basis {¥j}j, in which meaning vectors of atomic 
words, such as nouns, live. The basis vectors of N 
are in principle all the words from the corpus, how- 
ever in practice and following Mitchell and Lapata 
(2008) we had to restrict these to a subset of the 
most occurring words. These basis vectors are not 
restricted to nouns: they can as well be verbs, adjec- 
tives, and adverbs, so that we can define the mean- 
ing of a noun in all possible contexts — as is usual 
in context-based models — and not only in the con- 
text of other nouns. Note that basis words with re- 
lational types are treated as pure lexical items rather 
than as semantic objects represented as matrices. In 
short, we count how many times a noun has occurred 
close to words of other syntactic types such as 'elect' 
and 'scientific', rather than count how many times it 
has occurred close to their corresponding matrices: 
it is the lexical tokens that form the context, not their 
meaning. 

Each relational word P with grammatical type ir 
and m adjoint types ai, ct2, ■ ■ ■ , a m is encoded as 
an (r x ... x r) matrix with m dimensions. Since 
our vector space N has a fixed basis, each such ma- 



trix is represented in vector form as follows: 

U ■ ■ ■ C m 

m 

This vector lives in the tensor space 
N <g> N • • • ® N. Each Cj,-.../- is computed 

S ^ , 

m 

according to the procedure described in Figure 4. 



Figure 4: Procedure for learning weights for matrices of 
words 'P' with relational types 7r of m arguments. 

Linear algebraically, this procedure corresponds to 
computing the following 

^ = Yl C^ 1 ^ 2 ® " ■ ® ^ m )k 
k 

Type-logical examples of relational words are 
verbs, adjectives, and adverbs. A transitive verb is 
represented as a 2 dimensional matrix since its type 
is n r sn l with two adjoint types n r and n l . The cor- 
responding vector of this matrix is 

vert) = ^2 c ij C^i ® 

ij 



1) Consider a sequence of words containing a re- 
lational word 'P' and its arguments Wi, W2, ■ • ■ , 
w m , occurring in the same order as described in 
P's grammatical type n. Refer to these sequences 
as 'P' -relations. Suppose there are k of them. 

2) Retrieve the vector w; of each argument w/. 

3) Suppose Wi has weight c\ on basis vector r^j, 
W2 has weight cj on basis vector it j, • • • , and 
w m has weight c™ on basis vector it ^. Multiply 
these weights 

Cj X Cj X * * * X c^- 

4) Repeat the above steps for all the k 'P'- 
relations, and sum" the corresponding weights 

''../-. {<■: xc j x ■■■ xc T)k 
k 

"We also experimented with multiplication, but the spar- 
sity of noun vectors resulted in most verb matrices being 
empty. 



The weight corresponding to basis vector T^j® 
~n j, is the extent according to which words that have 
co-occurred with ¥j have been the subject of the 
'verb' and words that have co-occurred with ~rtj 
have been the object of the 'verb'. This example 
computation is demonstrated in Figure 5. 



Figure 5: Procedure for learning weights for matrices of 
transitive verbs. 

Linear algebraically, we are computing 

verfj = (v^i <g> ~^2) k 

k 

As an example, consider the verb 'show' and sup- 
pose there are two 'show '-relations in the corpus: 

si = table show result 

S2 = map show location 
The vector of 'show' is 

show = table <g> result + map* ® location 

Consider an N space with four basis vectors 'far', 
'room', 'scientific', and 'elect'. The TF/IDF- 
weighted values for vectors of the above four nouns 
(built from the BNC) are as shown in Table 1 . 



i 




table 


map 


result 


location 


1 


far 


6.6 


5.6 


7 


5.9 


2 


room 


27 


7.4 


0.99 


7.3 


3 


scientific 





5.4 


13 


6.1 


4 


elect 








4.2 






Table 1: Sample weights for selected noun vectors. 

Part of the matrix of 'show' is presented in Table 2. 

As a sample computation, the weight c\\ for 
vector (1,1), i.e. (far, far) is computed by multiply- 
ing weights of 'table' and 'result' on far, i.e. 6.6 x 7, 





far 


room 


scientific 


elect 


far 




79.24 




AIM 


119.96 


27.72 


room 


232.66 


80.75 


396.14 


113.2 


scientific 


32.94 


31.86 


32.94 





elect 















Table 2: Sample semantic matrix for 'show'. 



multiplying weights of 'map' and 'location' on far, 
i.e. 5.6 x 5.9 then adding these 46.2 + 33.04 and 
obtaining the total weight 79.24. 

The same method is applied to build matrices for di- 
transitive verbs, which will have 3 dimensions, and 
adjectives and adverbs, which will be of 1 dimension 
each. 

5 Computing Sentence Vectors 

Meaning of sentences are vectors computed by tak- 
ing the variables of the categorical prescription of 
meaning (the linear map / obtained from the gram- 
matical reduction of the sentence) to be determined 
by the matrices of the relational words. For instance 
the meaning of the transitive sentence 'sub verb obj' 
is: 

sub verb obj = ^(sul) | | obj) cuj~s' \ 

itj 

We take V := W := N and S = N <g> N, then 
Ylitj c itj~St is determined by the matrix of the verb, 

i.e. substitute it by Y^a c ijC^i ® ~^j) 4 - Hence 

> 

sub verb obj becomes: 

y^(sub | ~rti)(~rtj | d^])cij(lti (g) it j) = 

ij 

Y,<r h <f<: j{ 7f, Tip 

This can be decomposed to point-wise multiplica- 
tion of two vectors as follows: 

ij ij 

4 Note that by doing so we are also reducing the verb space 
from N ® (N ® N) ® N to N ® N, since for our construction 
we only need tuples of the form n i ® ¥ j ® n j <8> n j which 
are isomorphic to pairs (ni ® ~n j). 



1) Consider phrases containing 'verb', its subject 
Wi and object W2. Suppose there are k of them. 

2) Retrieve vectors w i and 

3) Suppose ~w"i has weight cj on iti and w^2 has 
Cj on ~rtj. Multiply these weights cj x cj. 

4) Repeat the above steps for all k 'verb'- 
relations and sum the corresponding weights 
E fc (cJ x c 2 j)k 



The left argument is the Kronecker product of sub- 
ject and object vectors and the right argument is the 
vector of the verb, so we obtain 



^suS obj) vero 



Since is commutative, this provides us with a dis- 
tributional version of the type-logical meaning of the 
sentence: point-wise multiplication of the meaning 
of the verb to the Kronecker product of its subject 
and object: 



verb* ^suS obj) 



sub verb obj = 



This mathematical operation can be informally de- 
scribed as a structured 'mixing' of the information 
of the subject and object, followed by it being 'fil- 
tered' through the information of the verb applied 
to them, in order to produce the information of the 
sentence. 

In the transitive case, S = N N, hence s t = 
it, it,. More generally, the vector space cor- 
responding to the abstract sentence space S is the 
concrete tensor space (N ... N) for m the di- 
mension of the matrix of the 'verb'. As we have 
seen above, in practice we do not need to build this 
tensor space, as the computations thereof reduce to 
point-wise multiplications and summations. 

Similar computations yield meanings of sentences 
with adjectives and adverbs. For instance the mean- 
ing of a transitive sentence with a modified subject 
and a modified verb we have 



adj sub verb obj adv = 
(adv vert) ( (adj suS) obj) 

After building vectors for sentences, we can com- 
pare their meaning and measure their degree of syn- 
onymy by taking their cosine measure. 

6 Evaluation 

Evaluating such a framework is no easy task. What 
to evaluate depends heavily on what sort of applica- 
tion a practical instantiation of the model is geared 
towards. In (Grefenstette et al., 2011), it is sug- 
gested that the simplified model we presented and 
expanded here could be evaluated in the same way as 
lexical semantic models, measuring compositionally 



built sentence vectors against a benchmark dataset 
such as that provided by Mitchell and Lapata (2008). 
In this section, we briefly describe the evaluation of 
our model against this dataset. Following this, we 
present a new evaluation task extending the experi- 
mental methodology of Mitchell and Lapata (2008) 
to transitive verb-centric sentences, and compare our 
model to those discussed by Mitchell and Lapata 
(2008) within this new experiment. 

First Dataset Description The first experiment, 
described in detail by Mitchell and Lapata (2008), 
evaluates how well compositional models disam- 
biguate ambiguous words given the context of a po- 
tentially disambiguating noun. Each entry of the 
dataset provides a noun, a target verb and landmark 
verb (both intransitive). The noun must be com- 
posed with both verbs to produce short phrase vec- 
tors the similarity of which is measured by the can- 
didate. Also provided with each entry is a classifi- 
cation ("High" or "Low") indicating whether or not 
the verbs are indeed semantically close within the 
context of the noun, as well as an evaluator-set simi- 
larity score between 1 and 7 (along with an evaluator 
identifier), where 1 is low similarity and 7 is high. 

Evaluation Methodology Candidate models pro- 
vide a similarity score for each entry. The scores 
of high similarity entries and low similarity entries 
are averaged to produce a mean High score and 
mean Low score for the model. The correlation of 
the model's similarity judgements with the human 
judgements is also calculated using Spearman's p, a 
metric which is deemed to be more scrupulous, and 
ultimately that by which models should be ranked, 
by Mitchell and Lapata (2008). The mean for each 
model is on a [0, 1] scale, except for UpperBound 
which is on the same [1, 7] scale the annotators used. 
The p scores are on a [—1, 1] scale. It is assumed 
that inter-annotator agreement provides the theoret- 
ical maximum p for any model for this experiment. 
The cosine measure of the verb vectors, ignoring the 
noun, is taken to be the baseline (no composition). 

Other Models The other models we compare 
ours to are those evaluated by Mitchell and Lap- 
ata (2008). We provide a selection of the results 



from that paper for the worst (Add) and best 5 (Mul- 
tiply) performing models, as well as the previous 
second-best performing model (Kintsch). The ad- 
ditive and multiplicative models are simply applica- 
tions of vector addition and component-wise multi- 
plication. We invite the reader to consult (Mitchell 
and Lapata, 2008) for the description of Kintsch's 
additive model and parametric choices. 

Model Parameters To provide the most accurate 
comparison with the existing multiplicative model, 
and exploiting the aforementioned feature that the 
categorical model can be built "on top of" existing 
lexical distributional models, we used the parame- 
ters described by Mitchell and Lapata (2008) to re- 
produce the vectors evaluated in the original exper- 
iment as our noun vectors. All vectors were built 
from a lemmatised version of the BNC. The noun 
basis was the 2000 most common context words, 
basis weights were the probability of context words 
given the target word divided by the overall proba- 
bility of the context word. Intransitive verb function- 
vectors were trained using the procedure presented 
in §4. Since the dataset only contains intransitive 
verbs and nouns, we used S = N. The cosine mea- 
sure of vectors was used as a similarity metric. 

First Experiment Results In Table 3 we present 
the comparison of the selected models. Our categor- 
ical model performs significantly better than the ex- 
isting second-place (Kintsch) and obtains a p quasi- 
identical to the multiplicative model, indicating sig- 
nificant correlation with the annotator scores. 

There is not a large difference between the mean 
High score and mean Low score, but the distri- 
bution in Figure 6 shows that our model makes a 
non-negligible distinction between high similarity 
phrases and low similarity phrases, despite the ab- 
solute scores not being different by more than a few 
percentiles. 

5 The multiplicative model presented here is what is quali- 
fied as best in (Mitchell and Lapata, 2008). However, they also 
present a slightly better performing (p — 0.19) model which 
is a combination of their multiplicative model and a weighted 
additive model. The difference in p is qualified as "not sta- 
tistically significant" in the original paper, and furthermore the 
mixed model requires parametric optimisation hence was not 
evaluated against the entire test set. For these reasons, we chose 
not to include it in the comparison. 



Model 


High 


Low 


P 


Baseline 


0.27 


0.26 


0.08 


Add 


0.59 


0.59 


0.04 


Kintsch 


0.47 


0.45 


0.09 


Multiply 


0.42 


0.28 


0.17 


Categorical 


0.84 


0.79 


0.17 


UpperBound 


4.94 


3.25 


0.40 



Table 3: Selected model means for High and Low similar- 
ity items and correlation coefficients with human judge- 
ments, first experiment (Mitchell and Lapata, 2008). p < 
0.05 for each p. 
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Figure 6: Distribution of predicted similarities for the cat- 
egorical distributional model on High and Low similarity 
items. 

Second Dataset Description The second dataset 6 , 
developed by the authors, follows the format of the 
(Mitchell and Lapata, 2008) dataset used for the first 
experiment, with the exception that the target and 
landmark verbs are transitive, and an object noun 
is provided in addition to the subject noun, hence 
forming a small transitive sentence. The dataset 
comprises 200 entries consisting of sentence pairs 
(hence a total of 400 sentences) constructed by fol- 
lowing the procedure outlined in §4 of (Mitchell and 
Lapata, 2008), using transitive verbs from CELEX 7 . 
For examples of these sentences, see Table 4. The 
dataset was split into four sections of 100 entries 
each, with guaranteed 50% exclusive overlap with 

6 http : / /www . cs . ox . ac . uk/ activities/ CompD 
istMeaning/GS2 011data . txt 
7 http: / /celex. mpi . nl/ 



exactly two other datasets. Each section was given 
to a group of evaluators, with a total of 25, who were 
asked to form simple transitive sentence pairs from 
the verbs, subject and object provided in each entry; 
for instance 'the table showed the result' from 'table 
show result'. The evaluators were then asked to rate 
the semantic similarity of each verb pair within the 
context of those sentences, and offer a score between 
1 and 7 for each entry. Each entry was given an arbi- 
trary classification of HIGH or LOW by the authors, 
for the purpose of calculating mean high/low scores 
for each model. For example, the first two pairs in 
table 4 were classified as HIGH, whereas the second 
two pairs as LOW. 



Sentence 1 


Sentence 2 


table show result 


table express result 


map show location 


map picture location 


table show result 


table picture result 


map show location 


map express location 



Table 4: Example entries from the transitive dataset with- 
out annotator score, second experiment. 



Evaluation Methodology The evaluation 
methodology for the second experiment was 
identical to that of the first, as are the scales for 
means and scores. Here also, Spearman's p is 
deemed a more rigorous way of determining how 
well a model tracks difference in meaning. This is 
both because of the imprecise nature of the classifi- 
cation of verb pairs as HIGH or LOW; and since the 
objective similarity scores produced by a model that 
distinguishes sentences of different meaning from 
those of similar meaning can be renormalised in 
practice. Therefore the delta between HIGH means 
and LOW mean cannot serve as a definite indication 
of the practical applicability (or lack thereof) of 
semantic models; the means are provided just to aid 
comparison with the results of the first experiment. 

Model Parameters As in the first experiment, the 
lexical vectors from (Mitchell and Lapata, 2008) 
were used for the other models evaluated (additive, 
multiplicative and baseline) 8 and for the noun vec- 

8 Kintsch was not evaluated as it required optimising model 
parameters against a held-out segment of the test set, and we 
could not replicate the methodology of Mitchell and Lapata 



tors of our categorical model. Transitive verb vec- 
tors were trained as described in §4 with S = N<g>N. 

Second Experiment Results The results for the 
models evaluated against the second dataset are pre- 
sented in Table 5. 



Model 


High 


Low 


P 


Baseline 


0.47 


0.44 


0.16 


Add 


0.90 


0.90 


0.05 


Multiply 


0.67 


0.59 


0.17 


Categorical 


0.73 


0.72 


0.21 


UpperBound 


4.80 


2.49 


0.62 



Table 5: Selected model means for High and Low similar- 
ity items and correlation coefficients with human judge- 
ments, second experiment, p < 0.05 for each p. 

We observe a significant (according to p < 0.0.5) 
improvement in the alignment of our categorical 
model with the human judgements, from 0.17 to 
0.21. The additive model continues to make lit- 
tle distinction between senses of the verb during 
composition, and the multiplicative model's align- 
ment does not change, but becomes statistically in- 
distinguishable from the non-compositional baseline 
model. 

Once again we note that the high-low means are 
not very indicative of model performance, as the dif- 
ference between high mean and the low mean of the 
categorical model is much smaller than that of the 
both the baseline model and multiplicative model, 
despite better alignment with annotator judgements. 

7 Discussion 

In this paper, we described an implementation of the 
categorical model of meaning (Coecke et al., 2010), 
which combines the formal logical and the empiri- 
cal distributional frameworks into a unified seman- 
tic model. The implementation is based on build- 
ing matrices for words with relational types (ad- 
jectives, verbs), and vectors for words with atomic 
types (nouns), based on data from the BNC. We 
then show how to apply verbs to their subject/object, 
in order to compute the meaning of intransitive and 
transitive sentences. 

(2008) with full confidence. 



Other work uses matrices to model meaning (Ba- 
roni and Zamparelli, 2010; Guevara, 2010), but only 
for adjective-noun phrases. Our approach easily ap- 
plies to such compositions, as well as to sentences 
containing combinations of adjectives, nouns, verbs, 
and adverbs. The other key difference is that they 
learn their matrices in a top-down fashion, i.e. by re- 
gression from the composite adjective-noun context 
vectors, whereas our model is bottom-up: it learns 
sentence/phrase meaning compositionally from the 
vectors of the compartments of the composites. Fi- 
nally, very similar functions, for example a verb with 
argument alternations such as 'break' in 'Y breaks' 
and 'X breaks Y', are not treated as unrelated. The 
matrix of the intransitive 'break' uses the corpus- 
observed information about the subject of break, in- 
cluding that of 'Y', similarly the matrix of the tran- 
sitive 'break' uses information about its subject and 
object, including that of 'X' and 'Y'. We leave a 
thorough study of these phenomena, which fall un- 
der providing a modular representation of passive- 
active similarities, to future work. 

We evaluated our model in two ways: first against 
the word disambiguation task of Mitchell and Lap- 
ata (2008) for intransitive verbs, and then against a 
similar new experiment for transitive verbs, which 
we developed. 

Our findings in the first experiment show that 
the categorical method performs on par with the 
leading existing approaches. This should not sur- 
prise us given that the context is so small and our 
method becomes similar to the multiplicative model 
of Mitchell and Lapata (2008). However, our ap- 
proach is sensitive to grammatical structure, lead- 
ing us to develop a second experiment taking this 
into account and differentiating it from models with 
commutative composition operations. 

The second experiment's results deliver the ex- 
pected qualitative difference between models, with 
our categorical model outperforming the others and 
showing an increase in alignment with human judge- 
ments in correlation with the increase in sentence 
complexity. We use this second evaluation princi- 
pally to show that there is a strong case for the devel- 
opment of more complex experiments measuring not 
only the disambiguating qualities of compositional 
models, but also their syntactic sensitivity, which is 
not directly measured in the existing experiments. 



These results show that the high level categori- 
cal distributional model, uniting empirical data with 
logical form, can be implemented just like any other 
concrete model. Furthermore it shows better results 
in experiments involving higher syntactic complex- 
ity. This is just the tip of the iceberg: the mathe- 
matics underlying the implementation ensures that 
it uniformly scales to larger, more complicated sen- 
tences and enables it to compare synonymity of sen- 
tences that are of different grammatical structure. 

8 Future Work 

Treatment of function words such as 'that', 'who', 
as well as logical words such as quantifiers and con- 
junctives are left to future work. This will build 
alongside the general guidelines of Coecke et al. 
(2010) and concrete insights from the work of Wid- 
dows (2005). It is not yet entirely clear how ex- 
isting set-theoretic approaches, for example that of 
discourse representation and generalised quantifiers, 
apply to our setting. Preliminary work on integration 
of the two has been presented by Preller (2007) and 
more recently also by Preller and Sadrzadeh ( 2009). 

As mentioned by one of the reviewers, our pre- 
group approach to grammar flattens the sentence 
representation, in that the verb is applied to its sub- 
ject and object at the same time; whereas in other 
approaches such as CCG, it is first applied to the 
object to produce a verb phrase, then applied to the 
subject to produce the sentence. The advantages and 
disadvantages of this method and comparisons with 
other systems, in particular CCG, constitutes ongo- 
ing work. 
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